13 research outputs found

    On the co-design of scientific applications and long vector architectures

    Get PDF
    The landscape of High Performance Computing (HPC) system architectures keeps expanding with new technologies and increased complexity. To improve the efficiency of next-generation compute devices, architects are looking for solutions beyond the commodity CPU approach. In 2021, the five most powerful supercomputers in the world use either GP-GPU (General-purpose computing on graphics processing units) accelerators or a customized CPU specially designed to target HPC applications. This trend is only expected to grow in the next years motivated by the compute demands of science and industry. As architectures evolve, the ecosystem of tools and applications must follow. The choices in the number of cores in a socket, the floating point-units per core and the bandwidth through the memory hierarchy among others, have a large impact in the power consumption and compute capabilities of the devices. To balance CPU and accelerators, designers require accurate tools for analyzing and predicting the impact of new architectural features on the performance of complex scientific applications at scale. In such a large design space, capturing and modeling with simulators the complex interactions between the system software and hardware components is a defying challenge. Moreover, applications must be able to exploit those designs with aggressive compute capabilities and memory bandwidth configurations. Algorithms and data structures will need to be redesigned accordingly to expose a high degree of data-level parallelism allowing them to scale in large systems. Therefore, next-generation computing devices will be the result of a co-design effort in hardware and applications supported by advanced simulation tools. In this thesis, we focus our work on the co-design of scientific applications and long vector architectures. We significantly extend a multi-scale simulation toolchain enabling accurate performance and power estimations of large-scale HPC systems. Through simulation, we explore the large design space in current HPC trends over a wide range of applications. We extract speedup and energy consumption figures analyzing the trade-offs and optimal configurations for each of the applications. We describe in detail the optimization process of two challenging applications on real vector accelerators, achieving outstanding operation performance and full memory bandwidth utilization. Overall, we provide evidence-based architectural and programming recommendations that will serve as hardware and software co-design guidelines for the next generation of specialized compute devices.El panorama de las arquitecturas de los sistemas para la Computación de Alto Rendimiento (HPC, de sus siglas en inglés) sigue expandiéndose con nuevas tecnologías y complejidad adicional. Para mejorar la eficiencia de la próxima generación de dispositivos de computación, los arquitectos están buscando soluciones más allá de las CPUs. En 2021, los cinco supercomputadores más potentes del mundo utilizan aceleradores gráficos aplicados a propósito general (GP-GPU, de sus siglas en inglés) o CPUs diseñadas especialmente para aplicaciones HPC. En los próximos años, se espera que esta tendencia siga creciendo motivada por las demandas de más potencia de computación de la ciencia y la industria. A medida que las arquitecturas evolucionan, el ecosistema de herramientas y aplicaciones les debe seguir. Las decisiones eligiendo el número de núcleos por zócalo, las unidades de coma flotante por núcleo y el ancho de banda a través de la jerarquía de memoría entre otros, tienen un gran impacto en el consumo de energía y las capacidades de cómputo de los dispositivos. Para equilibrar las CPUs y los aceleradores, los diseñadores deben utilizar herramientas precisas para analizar y predecir el impacto de nuevas características de la arquitectura en el rendimiento de complejas aplicaciones científicas a gran escala. Dado semejante espacio de diseño, capturar y modelar con simuladores las complejas interacciones entre el software de sistema y los componentes de hardware es un reto desafiante. Además, las aplicaciones deben ser capaces de explotar tales diseños con agresivas capacidades de cómputo y ancho de banda de memoria. Los algoritmos y estructuras de datos deberán ser rediseñadas para exponer un alto grado de paralelismo de datos permitiendo así escalarlos en grandes sistemas. Por lo tanto, la siguiente generación de dispósitivos de cálculo será el resultado de un esfuerzo de codiseño tanto en hardware como en aplicaciones y soportado por avanzadas herramientas de simulación. En esta tesis, centramos nuestro trabajo en el codiseño de aplicaciones científicas y arquitecturas vectoriales largas. Extendemos significativamente una serie de herramientas para la simulación multiescala permitiendo así obtener estimaciones de rendimiento y potencia de sistemas HPC de gran escala. A través de simulaciones, exploramos el gran espacio de diseño de las tendencias actuales en HPC sobre un amplio rango de aplicaciones. Extraemos datos sobre la mejora y el consumo energético analizando las contrapartidas y las configuraciones óptimas para cada una de las aplicaciones. Describimos en detalle el proceso de optimización de dos aplicaciones en aceleradores vectoriales, obteniendo un rendimiento extraordinario a nivel de operaciones y completa utilización del ancho de memoria disponible. Con todo, ofrecemos recomendaciones empíricas a nivel de arquitectura y programación que servirán como instrucciones para diseñar mejor hardware y software para la siguiente generación de dispositivos de cálculo especializados.Postprint (published version

    Evaluation of low-power architectures in a scientific computing environment

    Get PDF
    HPC (High Performance Computing) represents, together with theory and experiments, the third pillar of science. Through HPC, scientists can simulate phenomena otherwise impossible to study. The need of performing larger and more accurate simulations requires to HPC to improve every day. HPC is constantly looking for new computational platforms that can improve cost and power efficiency. The Mont-Blanc project is a EU funded research project that targets to study new hardware and software solutions that can improve efficiency of HPC systems. The vision of the project is to leverage the fast growing market of mobile devices to develop the next generation supercomputers. In this work we contribute to the objectives of the Mont-Blanc project by evaluating performance of production scientific applications on innovative low power architectures. In order to do so, we describe our experiences porting and evaluating sate of the art scientific applications on the Mont-Blanc prototype, the first HPC system built with commodity low power embedded technology. We then extend our study to compare off-the-shelves ARMv8 platforms. We finally discuss the most impacting issues encountered during the development of the Mont-Blanc prototype system

    Optimizing sparse matrix-vector multiplication in NEC SX-Aurora vector engine

    Get PDF
    Sparse Matrix-Vector multiplication (SpMV) is an essential piece of code used in many High Performance Computing (HPC) applications. As previous literature shows, achieving efficient vectorization and performance in modern multi-core systems is nothing straightforward. It is important then to revisit the current stateof-the-art matrix formats and optimizations to be able to deliver deliver high performance in long vector architectures. In this tech-report, we describe how to develop an efficient implementation that achieves high throughput in the NEC Vector Engine: a 256 element-long vector architecture. Combining several pre-processing and kernel optimizations we obtain an average 12% improvement over a base SELLC-s implementation on a heterogeneous set of 24 matrices.Preprin

    Resultados de una encuesta sobre el soporte nutricional perioperatorio en la cirugía pancreática y biliar en España

    Full text link
    Introduction: a survey on peri-operative nutritional support in pancreatic and biliary surgery among Spanish hospitals in 2007 showed that few surgical groups followed the 2006 ESPEN guidelines. Ten years later we sent a questionnaire to check the current situation. Methods: a questionnaire with 21 items sent to 38 centers, related to fasting time before and after surgery, nutritional screening use and type, time and type of peri-operative nutritional support, and number of procedures. Results: thirty-four institutions responded. The median number of pancreatic resections (head/total) was 29.5 (95% CI: 23.0-35; range, 5-68) (total, 1002); of surgeries for biliary malignancies (non-pancreatic), 9.8 (95% CI: 7.3-12.4; range, 2-30); and of main biliary resections for benign conditions, 10.4 (95% CI: 7.6-13.3; range, 2-33). Before surgery, only 41.2% of the sites used nutritional support (&lt; 50% used any nutritional screening procedure). The mean duration of preoperative fasting for solid foods was 9.3 h (range, 6-24 h); it was 6.6 h for liquids (range, 2-12). Following pancreatic surgery, 29.4% tried to use early oral feeding, but 88.2% of the surveyed teams used some nutritional support; 26.5% of respondents used TPN in 100% of cases. Different percentages of TPN and EN were used in the other centers. In malignant biliary surgery, 22.6% used TPN always, and EN in 19.3% of cases. Conclusions: TPN is the commonest nutrition approach after pancreatic head surgery. Only 29.4% of the units used early oral feeding, and 32.3% used EN; 22.6% used TPN regularly after surgery for malignant biliary tumours. The 2006 ESPEN guideline recommendations are not regularly followed 12 years after their publication in our country.Introducción: realizamos una encuesta sobre soporte nutricional perioperatorio en cirugía pancreática y biliar en hospitales españoles en 2007, que mostró que pocos grupos quirúrgicos seguían las guías de ESPEN 2006. Diez años después enviamos un cuestionario para comprobar la situación actual. Métodos: treinta y ocho centros recibieron un cuestionario con 21 preguntas sobre tiempo de ayunas antes y después de la cirugía, cribado nutricional, duración y tipo de soporte nutricional perioperatorio, y número de procedimientos. Resultados: respondieron 34 grupos. La mediana de pancreatectomías (cabeza/total) fue de 29,5 (IC 95 %: 23,0-35; rango, 5-68) (total, 1002), la de cirugías biliares malignas de 9,8 (IC 95 %: 7,3-12,4; rango, 2-30) y la de resecciones biliares por patología benigna de 10,4 (IC 95 %: 7,6-13,3; rango, 2-33). Solo el 41,2 % de los grupos utilizaban soporte nutricional antes de la cirugía (< 50 % habian efectuado un cribado nutricional). El tiempo medio de ayuno preoperatorio para sólidos fue de 9,3 h (rango, 6-24 h), y de 6,6 h para líquidos (rango, 2-12). Tras la pancreatectomía, el 29,4 % habían intentado administrar una dieta oral precoz, pero el 88,2 % de los grupos usaron algún tipo de soporte nutricional y el 26,5 % usaron NP en el 100 % de los casos. Los demás grupos usaron diferentes porcentajes de NP y NE en sus casos. En la cirugía biliar maligna, el 22,6 % utilizaron NP siempre y NE en el 19,3 % de los casos. Conclusiones: la NP es el soporte nutricional más utilizado tras la cirugía de cabeza pancreática. Solo el 29,4 % de las unidades usan nutrición oral precoz y el 32,3 % emplean la NE tras este tipo de cirugía. El 22,6 % de las instituciones usan NP habitualmente tras la cirugía de tumores biliares malignos. Las guías ESPEN 2006 no se siguen de forma habitual en nuestro país tras más de 10 años desde su publicación

    Disseny i avaluació d'un cluster HPC: aplicacions

    No full text
    [CASTELLÀ] En este trabajo se presenta un análisis de las aplicaciones HPC del estado del arte. Estudia posibles tecnicas de optimización. Al final se ha hecho una evaluación de un clúster basado en tecnología móvil.[ANGLÈS] In this project an analysis of HPC applications in the state of the art is presented. Studies possible optimization techniques. At the end an evaluation of a cellphone technology cluster is made

    Disseny i avaluació d'un cluster HPC: aplicacions

    No full text
    [CASTELLÀ] En este trabajo se presenta un análisis de las aplicaciones HPC del estado del arte. Estudia posibles tecnicas de optimización. Al final se ha hecho una evaluación de un clúster basado en tecnología móvil.[ANGLÈS] In this project an analysis of HPC applications in the state of the art is presented. Studies possible optimization techniques. At the end an evaluation of a cellphone technology cluster is made

    Evaluation of low-power architectures in a scientific computing environment

    No full text
    HPC (High Performance Computing) represents, together with theory and experiments, the third pillar of science. Through HPC, scientists can simulate phenomena otherwise impossible to study. The need of performing larger and more accurate simulations requires to HPC to improve every day. HPC is constantly looking for new computational platforms that can improve cost and power efficiency. The Mont-Blanc project is a EU funded research project that targets to study new hardware and software solutions that can improve efficiency of HPC systems. The vision of the project is to leverage the fast growing market of mobile devices to develop the next generation supercomputers. In this work we contribute to the objectives of the Mont-Blanc project by evaluating performance of production scientific applications on innovative low power architectures. In order to do so, we describe our experiences porting and evaluating sate of the art scientific applications on the Mont-Blanc prototype, the first HPC system built with commodity low power embedded technology. We then extend our study to compare off-the-shelves ARMv8 platforms. We finally discuss the most impacting issues encountered during the development of the Mont-Blanc prototype system

    Optimizing the SpMV kernel on long-vector accelerators

    Get PDF
    Sparse Matrix-Vector multiplication (SpMV) is an essential kernel for parallel numerical applications. SpMV displays sparse and irregular data accesses, which complicate its vectorization. Such difficulties make SpMV to frequently experiment non-optimal results when run on long vector ISAs exploiting SIMD parallelism. In this context, the development of new optimizations becomes fundamental to enable high performance SpMV executions on emerging long vector architectures. In our work, we improve the state-of-the-art SELL-C- sparse matrix format by proposing several new optimizations for SpMV. We target aggressive long vector architectures like the NEC Vector Engine. By combining several optimizations, we obtain an average 12% improvement over SELL-C- considering a heterogeneous set of 24 matrices. Our optimizations boost performance in long vector architectures since they expose a high degree of SIMD parallelism

    Optimizing sparse matrix-vector multiplication in NEC SX-Aurora vector engine

    Get PDF
    Sparse Matrix-Vector multiplication (SpMV) is an essential piece of code used in many High Performance Computing (HPC) applications. As previous literature shows, achieving efficient vectorization and performance in modern multi-core systems is nothing straightforward. It is important then to revisit the current stateof-the-art matrix formats and optimizations to be able to deliver deliver high performance in long vector architectures. In this tech-report, we describe how to develop an efficient implementation that achieves high throughput in the NEC Vector Engine: a 256 element-long vector architecture. Combining several pre-processing and kernel optimizations we obtain an average 12% improvement over a base SELLC-s implementation on a heterogeneous set of 24 matrices

    Efficiently running SpMV on long vector architectures

    No full text
    Sparse Matrix-Vector multiplication (SpMV) is an essential kernel for parallel numerical applications. SpMV displays sparse and irregular data accesses, which complicate its vectorization. Such difficulties make SpMV to frequently experiment non-optimal results when run on long vector ISAs exploiting SIMD parallelism. In this context, the development of new optimizations becomes fundamental to enable high performance SpMV executions on emerging long vector architectures. In this paper, we improve the state-of-the-art SELL-C-s sparse matrix format by proposing several new optimizations for SpMV. We target aggressive long vector architectures like the NEC Vector Engine. By combining several optimizations, we obtain an average 12% improvement over SELL-C-s considering a heterogeneous set of 24 matrices. Our optimizations boost performance in long vector architectures since they expose a high degree of SIMD parallelism.The authors would like to acknowledge the support of NEC Corporation. This work is partially supported by the Spanish Ministry of Science and Technology through PID2019-107255GB project and by the Generalitat de Catalunya (contract 2017-SGR-1414). Marc Casas has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2017-23269.Peer ReviewedPostprint (author's final draft
    corecore